Week 3 - Statistics and Probability Thinking with R

There are some base R functions that make good tools for exploring how probability works and for visualizing the properties of our data sets. Now that we are familiar with how to construct the basic object types in R and how to subset them, we can use some other useful functions to gather descriptive statistics from a data set.

Weeks 1 and 2 we learned about R objects:

We learned a lot of useful R functions:

1.1 Introduction and Outline

The basic purpose of statistics is the collection, organization, and interpretation of data. These are interrelated concepts, so we will build our knowledge gradually in each domain, rather than attempt to master data collection (i.e. sampling and populations) before we advance to data organization.

Today, we’re working with a dataset that I generated with several plant taxa across a given depth. This basically simulates a stratigraphic study. We need to state some assumptions at the outset.

  • We have identified a sample of the fossil population.
  • All of the identifications are correct.

1.2 Downloading R Packages

There is a ton of user-generated code out there that we can use on our own data. One way this code is distributed is through packages, which are hosted by CRAN. Authors of the packages are trusted to maintain them and this does not always happen, so be careful how many dependencies you write into your code. For today’s lab, we need to use one of the functions from the “vioplot” package. Use the code below to make sure it is installed and, if not, download the package.

packages <- c("vioplot") #Make an object listing packages

install.packages(setdiff(packages, row.names(installed.packages()))) #Install any packages that are in the object, but not fount on your  machine.

#We can also do this manually by loading the vioplot library.

library(vioplot)
## Loading required package: sm
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
#vioplot::vioplot() #This lets us call a function without loading the library and is how we will use it later.

1.3 Data Collection Concepts

We have already introduced a range of data types as we’ve become more familiar with R objects. It is helpful to have a working definition of major types of data in mind as we move forward.

Our primary interest is in measuring variation and those categories within which we expect more than two states are defined as variables. A strong proportion of paleoecology research is conducted by counting the number of different types of fossil organisms, making each taxon a different variable. These are fossil variables (this is my term!)

  • Identifications
  • Counts
  • Morphometrics
  • Categorical Assessments (binary or nominal scales)

The other set of variables we are interested in relate to the sites/strata from which we collect this data. These data may be: synchronic, (coming from the present) describing present conditions of the site ; or diachronic, describing changes through time in parallel to the fossil variables. These are site variables (again, my terms). We’ve described some of the variables we might derive from sites/strata:

  • Temperature
  • Elevation (continuous numeric, interval scale)
  • pH (continuous numeric, log scale)
  • Sediment Composition (numeric continous interval, then ratio scale)
  • Chemistry (i.e. dissolved O2).

There are lots of kinds of data that we might collect, which we arrange in different ways. We can get a clearer sense of the importance of how data organization impacts subsequent analyses and interpretation.

2. Reading Data and Producing Descriptive Statistics

We will read the fake pollen data set from last week and use some functions to gather insights. We are taking for granted how many variables we are looking at here and how many samples we have collected. We will come back to populations and sampling a little later. For now, let’s concern ourselves with taking some basic measurements from our data set.

fake_core <- read.csv("data/Fake_core.csv",
         header = TRUE,
         row.names = "depth")

ncol(fake_core) #How many variables?
## [1] 9
nrow(fake_core) #How many samples?
## [1] 100
sum(fake_core) #This is the sum of all the values in the table.
## [1] 21481

2.1 Descriptive Statistics

Descriptive statistics can offer some initial insights into numeric variables. Let’s look at a handful of basic functions for deriving these descriptive statistics. Let’s derive these for one of the classes of fossils, “Cyperaceae undiff.”. These can be computed with some base R functions, but we will also derive these manually to be sure we understand these measurements.

sum(fake_core$Cyperaceae.undiff.)
## [1] 2526
mean(fake_core$Cyperaceae.undiff.)
## [1] 25.26
sum(fake_core$Cyperaceae.undiff.)/length(fake_core$Cyperaceae.undiff.) #Mean is sum of observations divided by number of observations.
## [1] 25.26

The median value is the middle value of the ranked values for a given variable. It’s the value where 50% of the observations fall above and below it. This also teaches us a couple of new useful base R functions (unique() and order())

median(fake_core$Cyperaceae.undiff.)
## [1] 25
vls_cyp <- unique(fake_core$Cyperaceae.undiff.) #Unique lets us get all of the unique values from our variable.

vls_cyp
##  [1] 10 11  6 19  9 15 12 14 21 17  7 18 16 22 20 25 32 23 36 30 29 28 34 31 26
## [26] 35 24 33 38 39 37 27 41 46 42 40 45
ord_cyp <- vls_cyp[order(vls_cyp)] #Here, we are taking those values and organizing them by rank-order. This works in a cool way.

ord_cyp
##  [1]  6  7  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## [26] 33 34 35 36 37 38 39 40 41 42 45 46
mdn_cyp <- ord_cyp[floor(length(ord_cyp)/2)] #Here, we grab the middle value of the range of numbers. There's differences for even and odd lists of values.

mdn_cyp
## [1] 25

2.2 Using Apply to Work Efficiently

We can use the apply() function to take another function (such as mean()) and apply it iteratively to a table of data. This makes our lives a great deal easier when working with tables of taxa. Below, we use apply, which will look for three arguments: a matrix object, a code designating rows/columns, and a function. You can substitute custom functions as well. Below, we apply sum, mean, and median to the entire fake core data set.

fake_sums <- apply(fake_core, 2, sum)

fake_mean <- apply(fake_core, 2, mean)

fake_median <- apply(fake_core, 2, median)

Some other functions also generate informative statistics for us. We can gather quantile estimates (fractions under which 25% of the observations fall) using summary(). Summary can also be used in the apply() format and yields a nice table if we make it an object.

summary(fake_core$Cyperaceae.undiff.)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   16.00   25.00   25.26   34.00   46.00
fake_smmy <- apply(fake_core, 2, summary)

fake_smmy
##         Cyperaceae.undiff. Typha Poaceae Alchornea Macaranga Celtis Blighia
## Min.                  6.00  2.00    2.00      3.00      1.00  12.00    3.00
## 1st Qu.              16.00  4.75   27.00     32.00     16.00  21.00    8.75
## Median               25.00  6.00   35.00     48.00     31.00  25.50   11.00
## Mean                 25.26  6.64   34.84     49.61     31.39  25.29   10.99
## 3rd Qu.              34.00  9.00   42.25     67.25     43.00  30.00   13.00
## Max.                 46.00 12.00   69.00    115.00     82.00  38.00   19.00
##         Sapotaceae.undiff. Guibourtia.demeusei
## Min.                  3.00                2.00
## 1st Qu.              10.00               11.00
## Median               16.00               15.50
## Mean                 15.44               15.35
## 3rd Qu.              20.00               19.25
## Max.                 29.00               28.00

Using barplot(), we can visualize this information relatively quickly.

par(mai = c(1, 2, 0.5, 0.5))

barplot(fake_smmy, 
        beside = TRUE, 
        horiz = TRUE, 
        las = 2)

par(mar = c(5, 4, 4, 2) + 0.1)

These basic statistics show us a few things about our data. We can also plot the distribution of Cyperaceae undiff. counts across the 100 samples and show where the median and the mean fall along this scatter plot.

plot(fake_core$Cyperaceae.undiff.,
     100:1,
     xlim = c(0, 50),
     pch = 21)

lines(c(mean(fake_core$Cyperaceae.undiff.),
        mean(fake_core$Cyperaceae.undiff.)),
      c(0, 100),
      lty = 3)

lines(c(median(fake_core$Cyperaceae.undiff.),
        median(fake_core$Cyperaceae.undiff.)),
      c(0, 100),
      lty = 2)

3. Frequency Distributions

These counts are integers, so we could expect repetition of some numbers. Let’s see what the distribution of individual counts looks like using table(), which will produce a table of counts from a vector of data.

plot(fake_core$Cyperaceae.undiff.,
     100:1,
     xlim = c(0, 50),
     pch = 19)

lines(table(fake_core$Cyperaceae.undiff.)*10, #Custom code to make this fit!
      lty = 3,
      col = "darkred",
      lwd = 1) #Adding line width argument here, lets us define the width of lines we're plotting.

lines(c(mean(fake_core$Cyperaceae.undiff.),
        mean(fake_core$Cyperaceae.undiff.)),
      c(0, 100),
      lty = 1,
      lwd = 3)

lines(c(median(fake_core$Cyperaceae.undiff.),
        median(fake_core$Cyperaceae.undiff.)),
      c(0, 100),
      lty = 1,
      col = "blue",
      lwd = 3)

We can use this basic method to look at other species in our table. Let’s look at Alchornea.

#We could always copy-paste the code we just used, but if we catch ourselves copy-pasting, that means that we may be better off with a function.

plot(fake_core$Alchornea,
     100:1,
     xlim = c(0, max(fake_core$Alchornea)),
     pch = 19)

lines(table(fake_core$Alchornea)*10,
      lty = 3,
      col = "darkred",
      lwd = 1) #Adding line width argument here, lets us define the width of lines we're plotting.

lines(c(mean(fake_core$Alchornea),
        mean(fake_core$Alchornea)),
      c(0, 100),
      lty = 1,
      lwd = 3)

lines(c(median(fake_core$Alchornea),
        median(fake_core$Alchornea)),
      c(0, 100),
      lty = 1,
      col = "blue",
      lwd = 3)

3.1 Using Functions to Look at Multiple Variables

When looking for ways to build functions, look for repetitions. We repeat the variable (Alchornea) and the functions applied to it. So what if R took any vector and ran all of these commands?

cust_plot = function(x){
  par(mfrow = c(3, 3))
  title_names = colnames(x)
  for(i in 1:ncol(x)){
    plot(x[,i],
       1:length(x[,i]),
       xlim = c(0, max(x[,i])),
       pch = 19,
       xlab = "NISP",
       ylab = "sample")
  
  lines(rep(mean(x[,i]),2),
        c(0, length(x[,i])),
        lwd = 3)
  
  lines(rep(median(x[,i]),2),
        c(0, length(x[,i])),
        lwd = 3,
        col = "blue")
  
  lines((table(x[,i])/sum(table(x[,i])))*100,
        lty = 3,
        col = "darkred",
        lwd = 1)
  
  title(main = title_names[i])
  axis(4, at = seq(10,100,10), labels = seq(0.1,1,0.1))
  }
  par(mfrow = c(1, 1))
}

Now that we have a function that we can use, all we have to do is switch up the inputs. With this limited number of taxa, we can actually plot all of them simultaneously.

cust_plot(fake_core)

3.2 Histograms

Let’s find out what the overall picture of fossil identifications from this site looks. In our imaginary dataset, fossil identifications are distributed evenly across a 2-meter sedimentary sequence. We want to know a bit more about our total observations, so let’s calculate a sum from the table.

One of the first questions we might ask is about the frequency distribution of a given variable. We can get a sense of this quickly with the histogram function (“hist()”).

par(mfrow = c(2, 1))

plot(fake_core$Cyperaceae.undiff.,
     100:1,
     xlim = c(0, 50),
     pch = 19)

lines(c(mean(fake_core$Cyperaceae.undiff.),
        mean(fake_core$Cyperaceae.undiff.)),
      c(0, 100),
      lty = 3)

lines(c(median(fake_core$Cyperaceae.undiff.),
        median(fake_core$Cyperaceae.undiff.)),
      c(0, 100),
      lty = 2)

#barplot(table(fake_core$Cyperaceae.undiff.))

hist(fake_core$Cyperaceae.undiff., 
     breaks = 10,
     xlim = c(0,50))

par(mfrow = c(1, 1))

4. More Descriptive Statistics

Now that we have an idea of how we can view our counts as frequency distributions, we can explore some more ways to describe these results.

Base R has functions to derive all of these, but let’s also derive them manually so we understand what these statistics are telling us.

sd(fake_core$Cyperaceae.undiff.) #Calling a single column
## [1] 10.25495
fake_sds = apply(fake_core, 2, sd) #Using apply to get sds of every column

fake_sds
##  Cyperaceae.undiff.               Typha             Poaceae           Alchornea 
##           10.254952            2.630512           12.074449           24.551284 
##           Macaranga              Celtis             Blighia  Sapotaceae.undiff. 
##           17.936980            5.560130            3.412677            6.049159 
## Guibourtia.demeusei 
##            6.192819
apply(fake_core, 2, var)
##  Cyperaceae.undiff.               Typha             Poaceae           Alchornea 
##          105.164040            6.919596          145.792323          602.765556 
##           Macaranga              Celtis             Blighia  Sapotaceae.undiff. 
##          321.735253           30.915051           11.646364           36.592323 
## Guibourtia.demeusei 
##           38.351010

Standard deviations help us characterize variation in either a sample or a population of things. It is derived from some related values that are worth knowing.

Methods for describing variance.

mad(fake_core$Cyperaceae.undiff.)
## [1] 13.3434
mad_cyp = sum( abs(fake_core$Cyperaceae.undiff. - median(fake_core$Cyperaceae.undiff.) ) ) / length(fake_core$Cyperaceae.undiff.)

mad_cyp
## [1] 8.86

Above, we get some different results between R and a manual calculation because the mad() function uses a constant to adjust for “asmptotically normal consistency” (see help(mad)). Notice first that we can use mean or median to as a basis for absolute deviation for our vectors. This can be set with the “center =” and “constant =” arguments in mad() or within our manual calculation, we take the mean value of the absolute differences between the individual observations and their mean. This looks different from the formula above because I’ve subsumed the “sum()” and “/length(x)” functions by calling the entire calculation under “mean()”. If we calculate the value manually using the median (second half of code below), we get the same value as the “mad()” function, using a constant of 1.

mad(fake_core$Cyperaceae.undiff., center = mean(fake_core$Cyperaceae.undiff.), constant = 1)
## [1] 8.74
mad_cyp = mean(abs(fake_core$Cyperaceae.undiff. - mean(fake_core$Cyperaceae.undiff.)))

mad_cyp
## [1] 8.8652
#Now we calculate the median average deviation.

mad_cyp = median(abs(fake_core$Cyperaceae.undiff - median(fake_core$Cyperaceae.undiff.)))

mad_cyp
## [1] 9
mad(fake_core$Cyperaceae.undiff., constant = 1)
## [1] 9

Here we are working with deviation from the mean or median. This is a logical enough way to express deviation in a set of numbers, but its applications and interpretation are limited (ex: how easy is it to compare between variables?). However, it does set the foundation of how we calculate variance, which is the sum of the squared deviations from the mean divided by the number of observations. For samples, we calculate the latter as the number of observations less one (N - 1). Base R has a function for variance “var()”, but we can also calculate it manually.

var(fake_core$Cyperaceae.undiff.)
## [1] 105.164
var_cyp = mean((fake_core$Cyperaceae.undiff. - mean(fake_core$Cyperaceae.undiff.))^2)

#Perhaps R calculates variance for a sample and not a population...

var_cyp = sum((fake_core$Cyperaceae.undiff. - mean(fake_core$Cyperaceae.undiff.))^2)/
  (length(fake_core$Cyperaceae.undiff.)-1) #Note that we're taking N - 1 here. 

var(fake_core$Cyperaceae.undiff.)
## [1] 105.164
var_cyp
## [1] 105.164

In this case, we’ve finally gotten identical values when compared with the base R function. But what does variance mean and how do we interpret it? R lets us generate a lot of fake and specifically structured data. Let’s look at some general features of variance by looking at systematically modified input.

my_df = sapply(1:10, function(x){ #Using function to create progressively larger runs of numbers that are equally spaced.
    seq(x, x*100, x)
})

apply(my_df, 2, var)
##  [1]   841.6667  3366.6667  7575.0000 13466.6667 21041.6667 30300.0000
##  [7] 41241.6667 53866.6667 68175.0000 84166.6667
plot(apply(my_df, 2, var), type = "o", pch = 21, bg = "orange")

Naturally, as the range of values gets larger the variance goes up. While the range grows by 100s, the variance does not grow linearly. Let’s look at a longer range of numbers and see how variance and the overall spread of the data interact.

my_df = sapply(1:100, function(x){ #Using function to create progressively larger runs of numbers that are equally spaced.
    seq(x, x*100, x)
})

apply(my_df, 2, var)
##   [1]     841.6667    3366.6667    7575.0000   13466.6667   21041.6667
##   [6]   30300.0000   41241.6667   53866.6667   68175.0000   84166.6667
##  [11]  101841.6667  121200.0000  142241.6667  164966.6667  189375.0000
##  [16]  215466.6667  243241.6667  272700.0000  303841.6667  336666.6667
##  [21]  371175.0000  407366.6667  445241.6667  484800.0000  526041.6667
##  [26]  568966.6667  613575.0000  659866.6667  707841.6667  757500.0000
##  [31]  808841.6667  861866.6667  916575.0000  972966.6667 1031041.6667
##  [36] 1090800.0000 1152241.6667 1215366.6667 1280175.0000 1346666.6667
##  [41] 1414841.6667 1484700.0000 1556241.6667 1629466.6667 1704375.0000
##  [46] 1780966.6667 1859241.6667 1939200.0000 2020841.6667 2104166.6667
##  [51] 2189175.0000 2275866.6667 2364241.6667 2454300.0000 2546041.6667
##  [56] 2639466.6667 2734575.0000 2831366.6667 2929841.6667 3030000.0000
##  [61] 3131841.6667 3235366.6667 3340575.0000 3447466.6667 3556041.6667
##  [66] 3666300.0000 3778241.6667 3891866.6667 4007175.0000 4124166.6667
##  [71] 4242841.6667 4363200.0000 4485241.6667 4608966.6667 4734375.0000
##  [76] 4861466.6667 4990241.6667 5120700.0000 5252841.6667 5386666.6667
##  [81] 5522175.0000 5659366.6667 5798241.6667 5938800.0000 6081041.6667
##  [86] 6224966.6667 6370575.0000 6517866.6667 6666841.6667 6817500.0000
##  [91] 6969841.6667 7123866.6667 7279575.0000 7436966.6667 7596041.6667
##  [96] 7756800.0000 7919241.6667 8083366.6667 8249175.0000 8416666.6667
plot(apply(my_df, 2, var),
     type = "o",
     pch = 21, 
     bg = "orange")

lines(apply(my_df, 2, mad)*1000) #Here we're adding the MAD values for the same dataset, multiplied by 1000 to make them visible.

Here, we see that the relationship between variance and the overall dispersion becomes linear after a total dispersion of about 60,000, which gives us a variance of about 3 million. So, we see here that populations which are structurally identical but which vary in their total quantity are going to have different variances. Specifically, the larger the total counts, the more variance we expect to find. We can also see that the median average deviation overestimates variance up to about 40,000 then continually under-estimates it afterwards. We will explore more about variance and population structure (i.e. composite frequency distribuion of variables in population) once we have mastered some more univariate statistics.

This brings us to the standard deviation, which is directly derived from variance by taking the square root of this value. Recall that our estimation of variance uses the squared value of the difference from the mean or median to avoid negative values. Here, we

We can also verify this with some population thinking using the function “rnorm()” which lets us create a normally-distributed sample using arguments for the mean and standard deviation.

my_df = sapply(1:1000, function(x){
  rnorm(100, mean = x*10, sd = 10)
})

apply(my_df, 2, var)
##    [1]  86.43602  81.04623  93.41824  95.68065  95.66001  99.99769 113.10026
##    [8]  81.28159 128.71530 112.53739 104.25615  98.16949 100.96653 110.14285
##   [15] 105.71425 121.05189 109.30490 100.26515  79.15460  75.88716 123.55001
##   [22]  93.33296 110.47562  98.43086  98.36710 117.58494  79.34682  97.12199
##   [29]  88.10484 109.36050 106.21436  77.77243  93.81907 128.50720 103.91883
##   [36] 113.99669  97.36699  88.35946  84.26002  95.89895 128.55023  93.48490
##   [43] 103.01394 121.41890 135.13755  94.02858 120.77102 108.49675 111.21228
##   [50] 111.23721  90.15699  87.00697  85.15751  86.52536  97.55071 132.36754
##   [57] 104.14708 100.43437  88.25676 103.76459 114.19481  80.71691 101.63133
##   [64]  71.65728 118.20968  83.73053 100.45519 114.93790 116.48571  95.43646
##   [71]  93.33557 122.04295 107.13707 127.32416 102.66290 107.47841  72.77745
##   [78]  96.18975 109.72075 104.67338 112.36526 105.03316 109.01343  94.22310
##   [85]  85.07706 104.95653  85.01930  84.35451  87.78385  95.27887  90.69149
##   [92]  79.67000  91.16766  98.02041 124.74980  68.82379  97.31295 102.37231
##   [99]  81.95056  81.57597  91.60206  92.13316 104.50111  95.31735  83.51081
##  [106]  67.76467 120.10239  64.93364  97.46471 105.94795 117.97568 127.30798
##  [113]  83.78085 100.94853  96.51296 156.15438 100.91837  90.70258 127.40791
##  [120]  92.38971  86.65891 106.11827 100.72576 122.86657 100.55885  93.42898
##  [127] 107.93477 101.27306  95.13220 101.84627 107.65515  92.89213  75.94973
##  [134] 105.23910 104.59508  98.73208  96.32934  76.15213 106.27399 104.36712
##  [141]  72.25246 110.31802  84.67983 108.49271 106.24657 122.48712  99.71773
##  [148] 112.83271 126.04257 113.47697  82.70034 106.96613 106.23922  88.90858
##  [155] 106.07925  78.09933 115.03284  88.31876  77.40177 120.39732 128.80509
##  [162] 120.26136  86.76821 109.85476  94.09984  75.13921  91.97230  73.04053
##  [169]  87.25041 128.03049 122.27502  88.29171  83.99959  99.05107  85.88326
##  [176] 129.42616 109.21000  85.59916  79.09065 113.56279 156.22468 122.33529
##  [183] 102.80078  88.29534 117.08089  68.50051 100.54075 106.72317 102.95375
##  [190] 102.90449  94.87299  91.22126  99.95752  97.57702 117.84341  87.77351
##  [197] 108.42536  89.37471  88.76265 103.22948 116.09970  85.43978  71.80068
##  [204]  96.38089  99.36350 100.58137  63.60791 110.57376 108.53914  86.67373
##  [211]  85.66010  96.31125  91.70900  85.73971 104.59393  90.83681  68.93931
##  [218] 122.14028 105.98790  98.18373  82.00454 100.69844  99.73856 121.94230
##  [225]  93.70943  93.49350 113.42169  87.80594  97.60107  87.19227  76.88486
##  [232]  97.76113  87.94244  99.34400  78.64725  75.07558 111.87692  93.11177
##  [239] 106.12544  91.76413  94.15350 103.53660  95.28555 124.42387  86.08723
##  [246]  99.70312  82.48634  92.56274  94.99040 127.43439  81.55168 111.86642
##  [253] 115.61685 110.02609 110.22373  95.51115  81.73646 107.74344  91.05578
##  [260]  93.56213  76.03085 117.84707  98.93600  80.74134 121.25888 115.77838
##  [267]  73.40461  82.39403 114.48086 106.62494 106.90053 112.75589  93.78639
##  [274]  97.17336 113.99766  97.11000  94.80572  99.89328 136.98890 116.86775
##  [281]  92.17392 123.75423  72.83874  95.58429 104.74528 107.32659  86.29507
##  [288] 122.62567  84.60047 125.74078  96.98317 126.18612 103.73632 103.41841
##  [295] 107.08689 101.88440 110.46844  90.18664  69.58966 103.34412  89.20149
##  [302]  96.06303 115.28201 102.36455 107.23345 103.28339  96.64046  76.79390
##  [309]  88.04955  91.55131 109.87846 114.59086 109.98601 104.41564  88.80262
##  [316]  99.01543 104.30076 102.80867  89.86182  82.96672  83.33373 121.95394
##  [323]  98.23224  95.83283 112.52469  88.34380 103.05825  86.98353 102.59296
##  [330] 110.52671  83.93316 110.33018  88.83676  89.31054 111.26061 114.39614
##  [337] 107.44773 122.08117  90.99283 105.89472  90.91029  91.77077  88.72255
##  [344] 107.74745 127.13401  84.44586 114.79723  74.77863  90.38012  89.64295
##  [351]  97.80544  87.10445 106.49056 110.03154 101.10828  96.57481  85.23905
##  [358]  83.35995 111.58518  93.88186  79.37787  88.16846  79.82199 115.15273
##  [365]  77.15309 113.99090  93.40345  95.18084  92.04032  90.44519 109.44970
##  [372] 100.37145 116.31277  99.07728 107.16130 101.16107  97.38881 105.80202
##  [379]  77.11363  88.66151 100.50205  98.66370 106.76322  85.13229  97.55205
##  [386]  87.01910  89.50128 125.47783  84.61064  91.93117  98.95393  98.03147
##  [393]  98.16912  97.07915 131.90068  74.00320  93.91382  97.54790  75.79964
##  [400] 112.16659  65.28398 129.21690 106.88685 121.69780 124.37753 109.32511
##  [407] 118.61710 111.90600 110.64936  94.91023 102.08340 131.66451 101.09490
##  [414]  88.29159 105.73424 116.13886 119.26761  80.82900 104.48777 119.95614
##  [421] 112.68472 129.04200  99.16974 109.02639 101.14142 124.24771 102.58746
##  [428]  95.85567 134.99178  83.87573 109.76191 100.39517  87.70544  95.44434
##  [435] 116.67350 103.76472  87.71350  90.59500  89.26986 114.65553  87.23517
##  [442] 101.78515  89.37495  84.17464 100.80926 110.67932 103.88138 109.94931
##  [449]  91.54486 102.76686 109.60249 121.12098 123.67542  95.26353  91.22173
##  [456]  91.50200  98.59658  98.21503 113.69909 102.68951 108.28799 105.73251
##  [463] 102.88767  79.03673 115.40431  87.58907  98.15618 128.72941 150.70157
##  [470] 114.65431 100.13502 112.97821 137.62176 100.78473  93.43494 100.66211
##  [477]  91.39532 139.66201  93.27215  77.30855  87.76868  72.48518 128.38140
##  [484]  89.21958  98.41411 123.82889 130.37301 109.49721 106.99027 105.17574
##  [491] 101.49021 104.72803  90.97111 113.97592  69.98639 113.36618 106.19404
##  [498]  80.02546 100.90596 102.35829  89.91860 109.39629  87.97620 111.04518
##  [505]  85.43552  86.31194 126.52674  89.00697 125.37620 119.51311 107.06096
##  [512]  77.34128 117.66389  80.76687  91.46261  91.43518  99.46640  82.85122
##  [519]  91.00910  98.95562  93.93577  77.05764 100.44236 100.15806  89.65028
##  [526] 100.86770 112.19645 105.31244  97.55990  87.01601  95.69284  95.32765
##  [533]  89.97343  88.26193  84.78384  78.79395  98.03362 111.82517 114.29908
##  [540]  91.54353  81.30405 116.48110  80.38186 107.22636 116.82192  96.89300
##  [547]  93.33704  97.72071 108.93979 117.50865  97.62689 132.82795  99.49734
##  [554]  98.03547 112.80935  74.00203 111.10890 116.84272 107.94726 109.38157
##  [561] 101.50200 115.97819  93.89402  98.29504 117.43219 103.50975 116.96671
##  [568] 114.26581 104.43884 123.50840 115.55302 122.58958  90.19271 117.37840
##  [575]  89.68613  91.30071  99.11512 103.59064 102.02149  95.88615 100.15853
##  [582]  78.82541 111.36690  84.75319 105.05680  95.29644 112.93263 101.72469
##  [589] 108.77982 103.38222 100.64346  98.68770  95.19561  83.65482 110.76632
##  [596]  77.99273 107.16631 120.46479 103.79465 102.28528  94.25020  85.07987
##  [603]  89.42790  93.44299 118.66587  90.42094 110.16549 100.72686  77.74932
##  [610] 121.66585  89.78759  97.95056 100.85665 102.62528  85.74748  88.18785
##  [617]  78.29265  82.98083  95.35656 115.34888 116.35757  86.34067 114.11872
##  [624] 137.14334  80.17560  83.19457  97.53772  85.29696 101.71136 113.30411
##  [631]  96.81134 123.43504  97.89140 118.40692  83.96352  82.54841 106.63084
##  [638] 111.07636 103.34984  82.78528 131.29544  92.28498  99.85599  74.60012
##  [645] 123.88343  96.20077  82.79962  76.18903 100.24252  96.98446  91.84358
##  [652] 117.45632  99.60014  97.30362 107.36586  71.12319  92.42311  78.39419
##  [659] 107.37694  95.91809  92.19207  97.79477  74.15575 107.05222  86.84263
##  [666]  95.89183  77.38350  88.45881  94.45727  85.23815  93.65194  84.67616
##  [673]  87.77595  82.32072 104.78599  90.77867 100.05163  96.96440  76.26077
##  [680]  99.96360 109.15124 109.12921  85.53802 101.44082 144.70644  97.41799
##  [687] 115.78542  99.83179  91.50322 108.41696 104.75829 118.47055  99.39962
##  [694] 111.61677  92.61357 101.72193  84.65056  96.79826  93.77592 102.85408
##  [701] 116.54398  97.55262 110.49365  97.13399  81.67777  87.81940 100.32151
##  [708] 103.68255  86.35637  81.01280  78.51323  91.01882  79.76489  90.41515
##  [715] 100.58154  92.36438  98.90659 100.07076  93.96008 109.53966 114.82484
##  [722]  77.10113 105.13734 100.06428 115.99112 131.93850  79.59447  92.63625
##  [729]  81.70294 107.24180 119.78090 102.53089 110.22773  94.25248  73.14096
##  [736] 108.34384  95.30761 115.57185 105.34667  98.60629  98.51229 105.77689
##  [743] 108.97143 105.40027 117.08709  91.12409  97.74929 100.55697  84.78179
##  [750]  85.95133 126.27207 133.92810 100.63484 104.10215 115.10413 111.30980
##  [757] 117.23917 108.15457  94.20896  81.24105 110.33504 103.59394 102.45498
##  [764]  99.01817 124.78870 109.57230  92.17517  93.87101 113.84112 105.28493
##  [771] 104.13426 100.42629 111.19874 106.31693 112.13090  95.13974 107.18964
##  [778]  99.71079  80.77707  89.96694 117.48942  93.71503 106.36693 116.03716
##  [785] 112.58839  79.76418  83.31039 119.30759 109.62970 102.01498  75.85219
##  [792]  72.29496 102.38959  84.54580  83.78673  84.82270  76.91264  89.97703
##  [799] 104.89948 105.23601  79.38732  85.41793  87.82973 119.36090  98.75368
##  [806] 108.46303  93.53961  87.99556 103.95507 119.48933 112.92449  68.25639
##  [813] 111.81482 123.78727  99.27525  92.99595  88.90288 127.97751 113.12320
##  [820]  84.16820  94.77948  85.72290  93.84984  83.75710 101.81914  94.86361
##  [827] 102.24249  97.44980 112.78512 113.64455  90.13069 101.74229 104.64577
##  [834] 105.97797  78.53685  84.92680 101.37609 113.77302  99.47423 107.35905
##  [841]  93.61080  96.61608  83.45744  92.87366  93.44405 104.45977  83.10836
##  [848] 113.79341 117.54922 104.38934  82.08580  79.31577  90.95491 117.78726
##  [855] 105.33701 110.86880 115.76697  93.32478  95.06207 113.45264  88.72373
##  [862] 115.58796  84.79670 102.99173  90.64495  78.01201 100.41420 105.24186
##  [869]  91.44065 103.75860  83.71579 103.85660  93.04039  93.64181 107.28817
##  [876]  89.04809 107.53074 115.70570  96.50619 117.60370 102.31978 107.41010
##  [883]  90.13873  81.52023 113.22903 111.00971  85.99148 116.34324 104.00544
##  [890]  95.41548 109.59264  98.50567  92.23996 100.04702  96.76548  92.80180
##  [897] 132.33876 108.42989  99.22534  90.04661 101.76260 123.40318  90.44727
##  [904]  92.92518  84.46304 110.04876 100.09729  86.01696  75.37641 100.96975
##  [911] 125.62997  93.00222 132.96192  92.68903 117.10552  91.99341 104.15062
##  [918] 132.78468 104.01031 107.70244  97.49159 115.09319  88.42970  84.75383
##  [925] 101.33984 104.97614 102.12685  77.54748 100.31834  99.93531  83.81903
##  [932]  80.02108  99.36971  89.63255 113.38794 121.50725  95.88037  86.49634
##  [939] 109.20163  91.82817 110.27048  87.90476 103.60809 113.56643  79.67539
##  [946]  86.35615 103.25490 110.98109  95.18692 102.45077  89.72522  89.85228
##  [953] 108.23344 106.89989 102.14495  83.42997 122.96897  95.21207 115.65962
##  [960]  84.60625  99.43685 102.26021  92.30453 106.25687  75.70325 103.35896
##  [967] 108.92397 109.98647  77.43205 124.61797 122.69624 100.90407 109.48374
##  [974]  98.87322  91.42835  86.30455 105.41535 100.38431  86.25766 105.16734
##  [981]  87.55284  93.51535  95.27562 108.10308 101.96264  90.00940 103.62418
##  [988]  96.60810 112.85930  81.77893  82.63769 104.58915 109.35365  84.33653
##  [995] 123.06861 101.03731  94.44056  89.60087 108.01350 118.78849
plot(apply(my_df, 2, var), type = "o", pch = 21, bg = "orange")

We can learn a bit more by creating longer lists of repeating numbers, which will show us some of the impacts of sample size. Below, we make a short list of numbers with a mean and median of 9, then we repeat that list of numbers 100 times, taking the variance and median absolute deviation. The code below gives us three plots: one showing variance/MAD across the 100 repetitions of the number string; one showing a histogram of this data, and another showing frequency distributions as displayed using a combination of “plot()” and “table()” methods.

test_range = c(5:13)

plot_var = sapply(1:100, function(x){
  var(rep(test_range, x))
})

plot_mad = sapply(1:100, function(x){
  mad(rep(test_range, x))
})

par(mfrow = c(1, 3))

plot(plot_var, ylim = c(0,10), pch = 21, bg = "orange")

lines(plot_mad)

hist(rep(test_range, 100))

plot(table(rep(test_range,100)))

par(mfrow = c(1, 1))

Regarding variance, we see a minor impact of the total sample size on our measurement of variance, but it quickly approaches an asymptote after being doubled three or four times. The distribution of this data is even and we see that increasing the number of observations (so long as they’re the same numbers) does not impact our estimation of the mean or median. Will this hold for a set of numbers where the frequency distribution isn’t even?

norm_range = #Using <shift+enter>, we can already make a visualization of the counts...
    c(5, 5,
      6, 6, 6,
      7, 7, 7, 7,
      8, 8, 8, 8, 8,
      9, 9, 9, 9, 9, 9,
      10, 10, 10, 10, 10,
      11, 11, 11, 11,
      12, 12, 12,
      13, 13)

plot_var = sapply(1:100, function(x){
  var(rep(norm_range, x))
})

plot_mad = sapply(1:100, function(x){
  mad(rep(norm_range, x))
})

par(mfrow = c(1, 2))

plot(plot_var, ylim = c(floor(min(plot_mad)) - 0.5, ceiling(max(plot_var)) +0.5), pch = 21, bg = "orange")

lines(plot_mad)

plot(table(rep(norm_range,100)))

par(mfrow = c(1, 1))

Let’s try this with a v-shaped distribution.

v_range = c(
      5, 5, 5, 5, 5, 5,
      6, 6, 6, 6, 6, 
      7, 7, 7, 7, 
      8, 8, 8,
      9, 9, 
      10, 10, 10, 
      11, 11, 11, 11,
      12, 12, 12, 12, 12,
      13, 13, 13, 13, 13, 13)

plot_var = sapply(1:100, function(x){
  var(rep(v_range, x))
})

plot_mad = sapply(1:100, function(x){
  mad(rep(v_range, x))
})

par(mfrow = c(1, 2))

plot(plot_var, ylim = c(floor(min(plot_mad)) - 0.5, ceiling(max(plot_var)) +0.5), pch = 21, bg = "orange")

lines(plot_mad)

plot(table(rep(v_range,100)))

par(mfrow = c(1, 1))

The replication structure we’re using here will serve us in the future. Let’s codify the repetitions and plotting into a function to save us all this copy-pasting.

stat_lines = function(x,#Creating a function called "stat_lines" it expects some vector of data "x" and the following arguments
                      times = length(x), #Setting this to "length(x)" by default, so calling this won't always be necessary. But we can add our own numbers!
                      title = NA){ #Setting up a title object, defaulting to NA
  
  plot_u = sapply(1:times, function(y){ #Here's that fast custom function that creates means for the repeated set of numbers.
    mean(rep(x, y))
  })
  
  plot_med = sapply(1:times, function(y){ #Same, but for the median.
    median(rep(x, y))
  })
  
  plot_var = sapply(1:times, function(y){ #Same, but for variance.
    var(rep(x, y))
  })
  
  plot_mad = sapply(1:times, function(y){ #Same, but for median absolute deviation
    mad(rep(x, y))
  })
  
  stat_labels = c("mean", "median", "variance", "median abs. dev.")
  
  plot_stats = data.frame(plot_u, plot_med, plot_var, plot_mad)
  
  plot(0, 0, xlim = c(0, times), ylim = c(0, max(plot_stats)+1), pch = NA, xlab = "N repeats", ylab = "value", main =paste0(title," Repeated Statistics"))
  
  for(i in 1:ncol(plot_stats)){
    points(plot_stats[, i], type = "o", pch = 21, lty = i, bg = i, cex = 0.7)
    text(times*0.05+(i*(0.2*times)), min(plot_stats[, i])+(0.04*max(plot_stats)), labels = stat_labels[i])
  }
  
}

stat_lines(v_range, title = "V-Shaped Range")

This makes plotting this much easier, allowing us to make some quick comparisons between our two small datasets, which we are making much larger using the “rep()” function.

par(mfrow = c(1,2))

stat_lines(v_range, title = "V-Shaped Range", times = 100)

stat_lines(norm_range, title = "Normal Range", times = 100)

par(mfrow = c(1,1))

We can apply this same approach to some of the fake core data.

par(mfrow = c(2,2))

stat_lines(fake_core$Cyperaceae.undiff., title = "Cyperaceae")
stat_lines(fake_core$Typha, title = "Typha")
stat_lines(fake_core$Alchornea, title = "Alchornea")
stat_lines(fake_core$Guibourtia.demeusei, title = "Guibourtia")

par(mfrow = c(1,1))

What all of this demonstrates is that:

  1. Variables with different frequency distributions may have the same mean and/or median value.
  2. Variance grows at an exponential rate as the total distribution of data increases.

Thus, to standardize our estimate of deviation, we take the square root of variance.

sd(fake_core$Cyperaceae.undiff.)
## [1] 10.25495
sd_cyp = sqrt(var(fake_core$Cyperaceae.undiff.))

sd_cyp
## [1] 10.25495

Let’s see how it performs when we systematically manipulate the data again.

par(mfrow = c(1, 2))

my_df = sapply(1:10, function(x){ #Using function to create progressively larger runs of numbers that are equally spaced.
    seq(x, x*100, x)
})

#apply(my_df, 2, sd)

plot(apply(my_df, 2, sd), type = "o", pch = 21, bg = "orange")
lines(apply(my_df, 2, mad))

my_df = sapply(1:1000, function(x){
  seq(x, x*100, x)
})

plot(apply(my_df, 2, sd), type = "o", pch = 21, bg = "orange")
lines(apply(my_df, 2, mad))

par(mfrow = c(1, 1))

Let’s add standard deviations to our basic plots and see where they fall.

stat_lines = function(x,#Creating a function called "stat_lines" it expects some vector of data "x" and the following arguments
                      times = length(x), #Setting this to "length(x)" by default, so calling this won't always be necessary. But we can add our own numbers!
                      title = NA){ #Setting up a title object, defaulting to NA
  
  plot_u = sapply(1:times, function(y){ #Here's that fast custom function that creates means for the repeated set of numbers.
    mean(rep(x, y))
  })
  
  plot_med = sapply(1:times, function(y){ #Same, but for the median.
    median(rep(x, y))
  })
  
  plot_var = sapply(1:times, function(y){ #Same, but for variance.
    var(rep(x, y))
  })
  
  plot_mad = sapply(1:times, function(y){ #Same, but for median absolute deviation
    mad(rep(x, y))
  })
  
  plot_sd = sapply(1:times, function(y){
    sd(rep(x, y))
  })
  
  stat_labels = c("mean", "median", "variance", "median abs. dev.", "SD") #Because we automate everything below, all we have to do is modify the names.
  
  plot_stats = data.frame(plot_u, plot_med, plot_var, plot_mad, plot_sd) #We're using the plot_stats dataframe below, so we modify the entry here.
  
  plot(0, 0, xlim = c(0, times), ylim = c(0, max(plot_stats)+1), pch = NA, xlab = "N repeats", ylab = "value", main =paste0(title," Repeated Statistics"))
  
  for(i in 1:ncol(plot_stats)){
    points(plot_stats[, i], type = "o", pch = 21, lty = i, bg = i, cex = 0.7)
    text(times*0.05+(i*(0.2*times)), min(plot_stats[, i])+(0.04*max(plot_stats)), labels = stat_labels[i])
  }
  
}

Let’s plot our selected taxa again.

par(mfrow = c(2,2))

stat_lines(fake_core$Cyperaceae.undiff., title = "Cyperaceae")
stat_lines(fake_core$Typha, title = "Typha")
stat_lines(fake_core$Alchornea, title = "Alchornea")
stat_lines(fake_core$Guibourtia.demeusei, title = "Guibourtia")

par(mfrow = c(1,1))

4.1 Box-and-Whisker Plots

These are useful plots for showing us the basic contours of our data.

par(mar = c(5, 6, 4, 2) + 0.1)

boxplot(fake_core, 
        horizontal = TRUE, 
        las = 2, 
        cex.axis = 0.5)

par(mar = c(5, 4, 4, 2) + 0.1)

4.2 Kernel Densities and Violin Plots

There are some more useful ways to visualize our data, however. We can use kernel density plots to look at overall distributions of data in a more detailed way. This also helps us get a sense of how normally distributed the data actually are.

dens_plot = lapply(fake_core, function(x){
  density(x)
})

par(mar = c(5, 6, 4, 2) + 0.1)

boxplot(fake_core, 
        horizontal = TRUE, 
        las = 2, 
        cex.axis = 0.5,
        main = "Boxplots and Kernel Densities")

for(i in 1:length(dens_plot)){
  ylines = (dens_plot[[i]][['y']]/max(dens_plot[[i]][['y']]))*0.5
  lines(dens_plot[[i]][['x']], ylines + i, lty = 3, col = "blue")
}

par(mar = c(5, 4, 4, 2) + 0.1)

You could also use the violplot method, which uses these densities to create boxplots where the margins are polygons shaped like the densities.

par(mar = c(5, 6, 4, 2) + 0.1)

vioplot::vioplot(fake_core, 
        horizontal = TRUE, 
        las = 2, 
        cex.axis = 0.5,
        col = "gold")

for(i in 1:length(dens_plot)){
  ylines = (dens_plot[[i]][['y']]/max(dens_plot[[i]][['y']]))*0.5
  lines(dens_plot[[i]][['x']], ylines + i, lty = 3, col = "blue")
}

par(mar = c(5, 4, 4, 2) + 0.1)